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Disclaimer nu 


The information in this presentation refers to specifications still in 
the development process. This presentation reflects the current 
thinking of various PCI-SIG® workgroups, but all material is 
subject to change before the specifications are released. 
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PCI-SIG®: An Open Industry Consortium /SIG 


Organization that defines the PCI Express® (PCle®) I/O bus 
specifications and related form factors Board of Directors 


830+ member companies located worldwide 2020 -2021 


Creating specifications and mechanisms to support 
compliance and interoperability 


PCI-SIG member companies support the following AM Del a r m 


usages with PCle technology: 


~ Cloud DALEMC ERE 

e Edge 

e Automotive intel) KEYSIGHT 
e Artificial intelligence ( TECHNOLOGIES 
e Analytics 

* Telecommunications A Qualcomm 
° o NVIDIA. 

e Consumer | 

+ Mobile SYNOPSYS 

° Data Center Silicon to Software 
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PCle® Architecture Layering for Modularity and Reuse’sie 
| Software & PCI compatibility, configuration, driver model 
= & PCle architecture enhanced configuration model 
: < Split-transaction, packet-based protocol 
Transaction à © Credit-based flow control, virtual channels 


e Data Link © Logical connection between devices 
= < Reliable data transport services (CRC, Retry, Ack/Nak) 


Logical PHÍ & Physical information exchange 
© Interface initialization and maintenance 


< Market segment specific form factors 


Mechanical § < Evolutionary and revolutionary 


PCle®: One Base Specification — Peg 


Multiple Form Factors oaa A 
Jea E M.2 (aka SFF-8639) WG 


el 4 CRC 
= | AC I 
Add-in-card (AIC) has maximum [si 
system compatibility with existing 


servers and most reliable compliance 


42, 80, and 110mm 


Smallest footprint of Majority of SSDs sold 
16x20 mm PCle connector form Ease of deployment, hotplug, Program. Higher power envelope, High B/W with 
small and thin factors, use for boot or serviceability and options for height and length PCle 3.0 
platforms for max storage density  Single-Port x4 or Dual-Port x2 Prevalent in 
hand-held, loT, 
automotive 


Source: Intel Corporation 


(SFF TA 1006 — SSD) 


N (Up to 36 Modules) 


(SFF TA 1002) | | | = i 
| Multiple form factors from the same silicon to meet the needs of different segments 


OL 


PCI f 
Evolution of PCI Express® Specifications, @_ 8.2" 
PCle 4.0 @ 16GT/s 


PCle 5.0 @ 32GT/s 


Le 


e PCle® architecture doubles the data rate every Bee tars 
generation with full backward compatibility every 
3 years @ Pcie 2.0 @ 5.0GT/s 
e Ubiquitous I/O across the compute continuum: PCle 1.0 @ 2.5GT/s 
PC, Hand-held, Workstation, Server, Cloud, 
Enterprise, HPC, Embedded, loT, Automotive, Al e 
Specification | (Encoding) per dirn 
* One stack / same silicon across all segments 1.0 2.5 (8b/10b) 32 Gb/s 2003 
with different form-factors, widths (x1/ x2/ x4/ x8/ 
x16) and data rates: e.g., a x16 PCle 5.0 au ene Keil, APE FL 
specification interoperates with a x1 PCle 1.0 3.0 8.0 (128b/130b) 126 Gb/s 2010 
specification! 4.0 16.0 (128b/130b) 252 Gb/s 2017 
5.0 32.0 (128b/130b) 504 Gb/s 2019 
6.0 (WIP) 64.0 (PAM-4, 1024 Gb/s 2021* 
FLIT) (~1Tb/s) 


*-Projected ** - bandwidth after encoding overhead 
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Bandwidth Drivers for PCle® 6.0 Specification aia 


— s 


Automotive 


Artificial Intelligence High-performance 


High-performance e Reliability 
e High-bandwidth 


Availability 
Serviceability 


Enterprise Servers PC/Mobile/loT 
Redundancy/failover Faster performance 
Ubiquity + - Power efficiency 
Power savings Low latency 


e Scalable architecture 
+, Increased performance 


e Reduced TCO 


Storage 
e Faster data transfer 


e Better user experience 


e Ubiquity 


(New Usage Models: Cloud, Al/ Analytics, Edge) 


e Device side: Networking (800G in 


early 2020s), Accelerators, FPGA/ 
ASICs, Memory 


Alternate Protocols on PCle 
technology 


As the per socket compute capability 
grows at an exponential pace, so 
does I/O needs — we have already 
added a lot of Lanes per socket 
(currently 128 Lanes) => speed has 
to go up 


But ... we need to meet the cost, 
performance, power metrics as an 
ubiquitous I/O with hundreds of 
Lanes in a platform 


New usage models are driving bandwidth demand — doubling every three years 
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Key Metrics for PCle 6.0 Specification: Requirements s'e 


Data Rate 64 GT/s, PAM4 (double the bandwidth per pin every generation) 
Latency <10ns adder for Transmitter + Receiver over 32.0 GT/s (including FEC) 

(We can not afford the 100ns FEC latency as networking does with PAM-4) 
Bandwidth Inefficiency <2 % adder over PCle 5.0 across all payload sizes 
Reliability 0 < FIT << 1 for a x16 (FIT — Failure in Time, number of failures in 10° hours) 
Channel Reach Similar to PCle 5.0 specification under similar set up for Retimer(s) (maximum 2) 
Power Efficiency Better than PCle 5.0 specification 
Low Power Similar entry / exit latency for L1 low-power state 


Addition of a new power state (LOp) to support scalable power consumption with 
bandwidth usage without interrupting traffic 


Plug and Play Fully backwards compatible with PCle 1.x through PCle 5.0 


HVM-ready, cost-effective, scalable to hundreds of Lanes in a platform 


Need to make the right trade-offs to meet each of these metrics! 
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Pole 
PAM4 Signaling at 64.0 GT/s aed 


Voltage 2- Bit 
Level Encoding Balance 
Values 
e PAM4 signaling: Pulse Amplitude Modulation 4-level 3 +3 
e 4 levels (2 bits) encoded in same Unit Interval (UI) > 
e 3 eyes H 
e Helps channel loss (same Nyquist as 32.0 GT/s) 1 y 
e Reduced voltage levels (EH) and eye width 
increases susceptibility to errors — 3 eyes in same Ul 0 3 
e Gray Coding to help minimize errors in UI 
, Lu | Encoding per | Tx Rx Voltage (V) 
e Precoding to minimize errors in a burst UI (2bit) Voltage 
e Voltage levels at Tx and Rx define encoding 00 Vix V <= Vth1 
01 -Vix/3 Vth1 < V <= Vih2 
11 +Vix/3 Vth2 < V <= Vth3 
10 +Vtx V > Vth3 
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Error Assumptions and Characteristics w/ PAM-4 sre 


Parameters of interest: FBER and error correlation within Lane and across Lanes 


e FBER — First bit error rate re ae 
e Probability of the first bit error occurring at the Receiver First Errol | | | | 
e Receiving Lane may see a burst propagated due to DFE (FBER)y 
e The number of errors from the burst can be minimized 
e Constrain DFE tap weights - balance TxEQ, CTLE and DFE equalization 
e Correlation of errors across Lanes 
e Due to common source of errors (e.g., power Supply noise) 
e Conditional probability that a first error in a Lane => errors in 
nearby Lanes 


e BER depends on the FBER and the error correlation in a 
Lane and across Lanes 


Lane correlation 


Burst error ina Lan 
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Handling Errors and Metrics Used for Evaluation | se 


e Two mechanisms to correct errors 
e Correction through FEC (Forward Error Correction) 
e Latency and complexity increases exponentially with the number of Symbols corrected 
e Detection of errors by CRC => Link Level Retry (a strength of PCle architecture) 
e Detection is linear: latency, complexity and bandwidth overheads 
e Need a robust CRC to keep FIT << 1 (FIT: Failure in Time — No of failures in 10° hours) 
e Metrics: Prob of Retry (or b/w loss due to retry) and FIT 
e Need to use both means of correction to achieve: 
e Low latency and complexity 
e Retry probability at acceptable level (no noticeable performance impact) 
e Low Bandwidth overhead due to FEC, CRC, and retry 
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Our Approach: Light-weight FEC and Retry dl 


° Light-weight FEC, strong CRC, and Metrics vs raw burst error probability 
keep the overall latency (including | 
retry) really low so that the Ld/St 
applications do not suffer latency 
penalty 


e We are better off retrying a packet 
with 10% (or 10°) probability with a 
retry latency of 100ns vs having a 
FEC latency impact of 100ns with a | A = 
much lower retry probability OU aes O 166 O7 ue 2 


Burst Probability 


Probability and BER 


eee Retry Prob/ flit (Single Symbol Correct) === Retry Prob/ flit (Double Symbol Correct) 


== fFffective BER (Single Symbol Correct) == Effective BER (Double Symbol Correct) 
—— FIT (Single Symbol Correct) eee FIT (Double Symbol Correct) 


Low latency mechanism w/ FBER of 1E-6 to meet the metrics (latency, area, power, bandwidth) i 
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FLIT Encoding PCle 6.0 Specification: EN 
Low-latency with High Efficiency seu 


TLP Bytes | 
(0-299) 


7 


i=] 
= 
N 
a 
[a 
o 
N 


© 


e FLIT (flow control unit) based: FEC needs fixed set of bytes 
e Correction in FLIT => CRC (detection) in FLITs => Retry at FLIT level 
e Lower data rates will also use the same FLIT once enabled 


N 
p5 


BR 
co 


N 
N 


e FLIT size: 256B 
e 236B TLP, 6B DLP, 8B CRC, 6B FEC 
e No Sync hdr, no Framing Token (TLP reformat), no T(DL)LP CRC 
e Improved bandwidth utilization due to overhead amortization 
e FLIT Latency: 2ns x16, 4ns x8, 8 ns x4, 16 ns x2, 32 ns x1 
e Guaranteed Ack and credit exchange => low latency, low storage 
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e Optimization: Retry error FLIT only with existing Go-Back-N retry 


= 
wo 
N 
Es 
wW 


b hb bh = 
co uw U © m. 
o a N oo N S 


N 
2 
a 


Low latency improves performance and reduces area 


= e m. = 
co = N 


m. 
wo 


N = b bh 
a w b a J] 
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a 
Replay in FLIT Mode /SIG 


e Once in FLIT mode, we are always in FLIT mode even when the data rate degrades to an NRZ data rate 
(e.g., 2.5 GT/s, 5.0 GT/s, 8.0 GT/s, 16.0 GT/s, 32.0 GT/s) 


e FLIT with NOP-only TLPs not replayed unless the subsequent FLIT also had an uncorrectable error 
e On a replay, the Transmitter can choose to skip over the NOP-only TLP FLITs 
e All replayed FLITs have the Replay Cmd = 11b (w/ Tx sequence number sent) 


ie al P 

XX ee NAK — replay only FLIT 13 (Replay Cmd = 10b) 

= T Ia aa NAK — replay from FLIT 19 (Replay Cmd = 01b) 
© àL Sjlalr |a| ələ 
Za 4 3 Jd 4 zZ 4 
Seg 297 TTT HSS 
vvo oo a St oY ie S| Ss 

— a pay — 
poe ees See PI 
j cg g6& E868 
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Retry Probability and FIT 
VS. FBER/ Correlation 


e Single Symbol Correct interleaved FEC plus 64-b 
CRC works really well for raw FBER of 1E-6 even 
with high Lane correlation 


e Retry probability per FLIT is 5 x 10° 
e B/W loss is 0.05% even with go-back-n 
e FIT is almost 0 


e Can mitigate the bandwidth loss significantly by 
adopting retry only the non-NOP TLP FLIT 
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Retry Time (ns) 

Raw Burst Error Probability 

Correlation second Lanes 

Width of Link 

Frequency 

Bits per Flit/ lane 

Prob 0 error/ Lane (no correlation Lanes) 
Prob 1 error / Lane (no correlation Lanes) 
Prob 2 errors/Lane (no correlation Lanes) 
Prob 3 errors/Lane (no correlation Lanes) 
Prob 4 errors/Lane (no correlation Lanes) 
Prob 0 errors in FLIT (w/ Lane correlation) 
Prob 1 errors in FLIT (w/ Lane correlation) 
Prob 2 errors in FLIT (w/ Lane correlation) 
Prob 3 errors in FLIT (w/ Lane correlation) 
Prob 4 errors in FLIT (w/ Lane correlation) 
Prob 0 errors all Lanes/ FLIT (w/ correlation) 
Prob of 1 error all Lanes/ FLIT 

Retry Prob/ FLIT (>1 error in all Lanes/ FLIT) 


Number of FLITs over retry window 

0 uncorrected FLIT errors over retry window 
1 uncorrected FLIT errors over retry window 
Retry prob over Retry time 


Time per FLIT (ns) 
Flits per sec 

Flits per 1E9 hrs 
CRC bits 

Aliasing Prob 


SDC/ FLIT 

FIT (Failure in Time) 

Effective BER (Single Symbol Correct) 
Effective BER (Double Symbol Correct) 
Effective BER (Thirple Symbol Correct) 


ES | 


200 
1.00E-04 
1.00E-03 

16 

64 

128 
0.98728094 
0.01263846 
8.02622E-05 
3.37135E-07 
1.05365E-09 
0.814801918 
0.165450705 
0.018486407 
0.001203308 
5.44278E-05 
0.814801918 
0.164402247 
0.019747377 


100 
0.136082199 
0.274140195 
0.863917801 


2 
500000000 
1.8E+21 

64 
5.42101E-20 


2.95054E-24 
0.005310966 
6.17004E-05 
3.93042E-06 
1.70087E-07 


Pog 


SIG 
1.00E-05 1.00E-06 1.00E-07 
1.00E-03 1.00E-04 1.00E-05 
16 16 16 
64 64 64 
128 128 128 
0.998720812 0.999872008  0.9999872 
0.001278375  0.000127984 1.28E-05 
8.11777E-07 8.12698E-09  8.1279E-11 
3.4095E-10  3.41333E-13  3.4137E-16 
1.06548E-13 1.06667E-17  1.0668E-21 
0.979728191 0.997954095  0.99979522 
0.019778713 0.002040878  0.00020473 
0.000487166 5.02119E-06  5.0364E-08 
4.02153E-06  4.11326E-09  4.1225E-12 
4.59176E-08  4.7216E-12  4.7348E-16 
0.979728191 0.997954095  0.99979522 
0.019766156 _0.002040748  0.00020473 
0000453056 502723600) 5.037E-08 
100 100 100 
0.951874769  0.9994974 0.99999496 


0.046959754 _0.000502475 5.037E-06 
0.048125231 0.0005026 5.037E-06 


2 2 
500000000 500000000 
1.8E+21 1.8E+21 

64 64 
5.42101E-20 5.42101E-20 
2.4892E- -55959€-31 
4.48056E-06  4.60726E-10 
1.5351 OHE-08 
1.27108E-08  1.28687E-11 


1.43493E-10 1.4 


2 
500000000 
1.8E+21 

64 
5.421E-20 


2.5667E-35 
4.6201E-14 
1.574E-10 
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PCle 6.0 FLIT Mode Bandwidth D. 
at 64.0 GT/s 


e Bandwidth increase = 2X (BW efficiency of FLIT Bandwidth Scaling with PCle 6.0 at 64.0 GT/s over 
mode) / (BW efficiency in non-FLIT mode) PCle 5.0 at 32.0 GT/s w/ 2% DLLP overhead 


e Overall we see a >2X improvement in bandwidth 
(benefits most systems) 


e Efficiency gain reduces as TLP size increases 
e Beyond 512 B (128 DW) payload goes below 1 


Bandwidth efficiency improvement in FLIT mode due 
to the amortization of CRC, DLP, and ECC over a | 16 32 

FLIT (8% overhead) — works out better than sync DATA PAYLOAD SIZE (DW) 

hdr, DLLP, Framing Token per TLP, and 4B CRC per — EE Read-Wwrite 
TLP overheads in PCle 5.0 


U 
2 
= 
q 
Q 
n 
T 
— 
a 
z 
a 
2 
<q 
a 


Bandwidth Efficiency improvement causes > 2X bandwidth gain for up to 512B Payload in 64.0 GT/s FLIT mode 
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Latency Impact of FLIT Mode D: 


e FLIT accumulation in Rx only (Tx pipeline ) 
FEC + CRC delay expected to be ~ 1-2 ns 


Expected Latency savings due to removal of sync hdr, fixed FLIT sizes (no framing logic, no variable sized 
TLP/ CRC processing) is not considered in Tables here 


With twice the data rate and the above optimizations, realistically expect to see lower latency except for x2 
and x1 for smaller payload TLPs —worst case ~10ns adder 


i i . Latency in ns Latency in ns i 
r PE ty de eo tbl Des Sn E Pe e e a Acte @ Latenoy arSase Gud) 
(DW) (DW) @ 32.0GT/s 64.0 GT/s to accumulation (ns) (DW) (DW) @ 32.0GT/s 64.0 GT/s to accumulation (ns) 

0 4 6.09375 18 11.90625 0 4 0.380859375 1.125 0.744140625 
4 8 10.15625 20 9.84375 4 8 0.634765625 1.25 0.615234375 
8 12 14.21875 29 7.78125 8 12 0.888671875 1.375 0.486328125 
16 20 22 34375 26 3.65625 16 20 1.396484375 1.625 0.228515625 
32 36 38.59375 34 -4.59375 32 36 2.412109375 2.125 -0.287109375 
64 68 71.09375 50 -21.09375 64 68 4.443359375 3.125 -1.318359375 
128 132 136.09375 82 -54.09375 128 132 8.505859375 5.125 -3.380859375 
256 260 266.09375 146 -120.09375 256 260 16.63085938 9.125 -7.505859375 
512 516 526.09375 274 -252.09375 512 516 32.88085938 17.125 -15.75585938 


1024 1028 1046.09375 530 -516.09375 1024 1028 65.38085938 33.125 -32.25585938 


Meets or exceeds the latency expectations 
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Motivation for a new Low Power State | SIG USP 
DSP X16 Link 

e Existing low-power states: LOs, L1, Dynamic Link Width (DLW), 

Speed Change LOp Support Query 

e Served well for the set of usages so far and will continue 

ne (EUR Enabled) sess 
| | LOp Request x8 

e Increasingly there is demand for power consumption scaling 

with bandwidth usage without impacting traffic flow pop ReQUSSr Ack S) 

| | (Handshake: Lane 8-15'go electrically idle while 

e Solution: New state LOp — symmetric Traffic flows in Lanes 0-7) 

e Maintain at least one active Lane — they continue to carry . | 

traffic. Link still carries traffic during LOp width transition |" Fe a 

. oaa . o au X 
e Expect LOp PHY power savings similar to turning off power [Op Request 
for the idle Lanes 
andshake: Lane 8-15 retrain while traffic flows 
in Lanes 0-7;eventually Lanes 8-15 merge with 
Lanes 0-7 to carry traffic) 
MOT (16 active Lanes) -77777777777777 
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Key Metrics for PCle 6.0 Specification: Evaluation Peg 


Based on Current Trend SIG 


Expectations Evaluation (Trend) 
( ) 


Data Rate 64 GT/s, PAM4 (double the bandwidth per pin every generation) Meets (must do 

<10ns adder for Transmitter + Receiver over 32.0 GT/s (including FEC) Exceeds (Savings in latency with 
(We can not afford the 100ns FEC latency as n/w does with PAM-4) <10ns for x1/ x2 cases) 

Bandwidth <2 % adder over PCle 5.0 across all payload sizes Exceeds (getting >2X bandwidth in 

Inefficienc most cases) 

Reliability 0 < FIT << 1 for a x16 (FIT — Failure in Time, failures in 10% hours) Meets 

Channel Reach Similar to PCle 5.0 specification under similar set up for Retimer(s) (maximum 2) Meets 

el Sala sta" Better than PCle 5.0 specification Design dependent — expected to 

meet 

Low Power Similar entry/ exit latency for L1 low-power state Design dependent — expected to 
Addition of a new power state (LOp) to support scalable power consumption with meet; LOp looks promising 
bandwidth usage without interrupting traffic 

Plug and Play Fully backwards compatible with PCle 1.x through PCle 5.0 Meets 

eee 21 HVM-ready, cost-effective, scalable to hundreds of Lanes in a platform Expected to Meet 


On track to meet or exceed requirements on all key metrics 
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| PCI ¢ 
Conclusions and Call to Action "ea 


e PCle® 6.0 specification is at Rev 0.5 level; Rev 0.7 is in progress 


e Very challenging in multiple fronts 


e New signaling with PAM4: tradeoff around errors/ correlation, channels, performance/ area, and circuit 
complexity to double the bandwidth 


e Metrics (latency, bandwidth efficiency, area, cost, power) which are significantly more challenging than 
what other standards have done with PAM4 at lower speeds 


e e.g., 100+ ns FEC latency on other standards vs our single digit ns latency targets; 12+% bandwidth 
inefficiency in other standards vs <2% inefficiency targets for us) 


e We are on track to exceed or meet the requirements 


e Need to continue to do due diligence though analysis, simulations, and test silicon characterization to 
ensure we have a robust specification 


e We have the combined innovation capability of 8830+ members with a track record of delivering 
flawlessly against challenges for more than two decades — we will deliver this time also!! 


Consider joining PCI-SIG® if you have not done so; be a part of this exciting journey! 


6/9/2020 Copyright © 2020 PCI-SIG. All Rights Reserved 


Questions ds 
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Thank you for attending the PCI-SIG 
Q2 2020 Webinar 


For more information please go to 
www.pcisig.com 


